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1.  Introduction 


It  is  well  known  that  testing  affects  what  is  taught  in  the  schools.  As  nationwide  tests  of 
math  skills  or  reading  comprehension  become  established,  they  become  the  standards  by  which 
school  systems,  teachers,  and  students  are  judged.  They  unconsciously  dictate  what  students 
should  learn,  and  so  education  in  the  schools  begins  to  point  toward  teaching  the  skills  necessary 
to  do  well  on  these  tests.  The  most  flagrant  example  of  this  effect  are  the  courses  to  prepare 
students  to  take  the  Scholastic  Aptitude  Test  for  college  entrance,  but  the  effect  is  far  more 
pervasive  in  subtle  ways  throughout  the  schools. 

This  phenomenon  has  potentially  beneficial  side  effects  in  that  the  tests  form  uniform 
standards  by  which  we  can  compare  different  schools,  teachers,  and  students.  They  establish  for 
all  to  see,  as  it  were,  what  is  expected  of  students,  and  hence  point  the  nation’s  students  and 
teachers  toward  specific  objectives  that  can  be  debated,  quantified,  and  applied  equally  to  all. 

But  there  are  insidious  side  effects  that  are  less  well  known.  First,  there  is  the  tendency  for 
testing  to  drive  teaching  down  to  the  level  of  our  testing  technology  -  away  from  learning  and 
reasoning  skills  toward  more  easily  measurable  skills  that  can  be  tested  by  multiple -choice  items. 
Second,  testing  encourages  students  to  adopt  memorization  rather  than  understanding  as  their 
goal:  knowledge  is  learned  in  a  form  that  it  can  be  recalled  rather  than  in  a  form  that  it  can  be 
used  in  real  life  tasks.  Finally,  there  is  a  kind  of  test-taking  mentality  that  takes  over  and  helps 
turn  many  students  against  school  and  learning  more  generally.  Each  of  these  issues  needs  some 
amplification. 

Education  in  service  of  w  hat  can  be  measured.  There  is  a  growing  disparity  between 
what  we  think  we  should  be  teaching  students  and  what  we  actually  are  teaching.  This  disparity 
is  reflected  in  the  concern  in  the  education  community  about  teaching  critical  thinking  and 
metacognitive  skills  (e.g.,  Glaser,  1984).  I  suspect  the  disparity  arises  mainly  in  our  escalation 
of  expectations  for  the  schools.  Cuban  (1984)  reports  that  in  historical  terms  there  was  more 
emphasis  on  rote  skills  and  memorization  in  former  years  than  now.  But  machines  are  taking 
over  the  low-level  jobs  in  our  society,  leaving  more  and  more  demand  for  the  kinds  of  reasoning 
that  only  humans  are  capable  of. 
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However,  reasoning  and  metacognitive  skills  are  the  most  difficult  skills  to  measure.  They 
include  the  skills  of  planning,  monitoring  your  processing  during  a  task,  checking  what  you  have 
done,  estimating  what  a  reasonable  answer  might  be,  actively  considering  possible  alternative 
courses  of  action,  separating  relevant  information  from  irrelevant  information,  choosing 
problems  that  are  useful  to  work  on,  asking  good  questions,  etc.  These  are  skills  that  current 
tests  for  the  most  part  do  not  measure,  nor  is  it  easy  to  see  how  they  could  be  measured  within  a 
single-item,  multiple -choice  format. 

But  these  are  the  kinds  of  skills  that  have  the  most  payoff  in  teaching,  as  evidenced  by  the 
success  of  Reciprocal  Teaching  and  other  "cognitive  apprenticeship"  methods  (Collins,  Brown, 
and  Newman,  in  press;  Palincsar  and  Brown,  1984;  Schoenfeld,  1983,  1985).  For  example, 
Palincsar  and  Brown  (1984)  produced  huge  gains  in  student’s  reading  comprehension  using  their 
Reciprocal  Teaching  method  that  taught  students  (1)  to  formulate  questions  about  texts,  (2)  to 
summarize  texts,  (3)  to  clarify  difficulties  with  texts,  and  (4)  to  make  predictions  about  what  is 
coming  next  in  texts.  These  skills  are  critical  to  the  ability  to  monitor  one's  reading 
comprehension  (Collins,  Brown,  and  Newman,  in  press),  but  they  are  not  the  kinds  of  skills  that 
are  easily  measured.  To  the  degree  that  testing  technology  drives  education,  it  will  drive 
teaching  away  from  these  high-order  thinking  skills  to  the  lower-order  skills  that  can  be 
measured  easily. 

Moreover,  to  the  degree  we  attempt  to  develop  tests  that  are  truly  diagnostic,  we  may 
exacerbate  the  problem  even  more.  There  have  been  some  great  successes  in  our  ability  to 
identify  systematic  student  errors  in  arithmetic  and  algebra  (Brown  and  Burton,  1978;  Brown 
and  VanLehn,  1980;  Matz,  1982;  Sleeman,  1982;  Tatsuoka,  this  volume).  One  notion  afoot  is 
that  since  we  can  diagnose  the  precise  errors  students  are  making,  we  can  then  teach  directly  to 
counter  these  errors.  Such  diagnosis  might  indeed  be  useful  in  a  system  where  diagnosis  and 
remediation  are  tightly  coupled,  as  for  example  in  the  LISP  tutor  described  by  Anderson  (this 
volume).  But  if  diagnosis  becomes  an  objective  in  nation-wide  tests,  then  this  will  dnve 
education  to  the  lower-order  skills  for  which  we  can  do  the  kind  of  fine  diagnosis  possible  for 
arithmetic.  Such  an  outcome  would  be  truly  disastrous.  It  is  precisely  the  kinds  of  skills  for 
which  we  can  do  Fine  diagnosis,  that  are  becoming  obsolete  in  the  computational  world  of  today. 
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Conventional  testing  promotes  memorization  rather  than  understanding  Tests  are 
one  of  the  great  incentives  for  students  to  study.  But  they  lead  students  into  the  worst  kinds  of 
study  strategies.  When  students  learn  information  for  tests,  they  are  developing  strategies  and 
memorizing  information  in  forms  that  are  of  little  or  no  use  for  real  world  problem  solving. 

For  example,  much  of  student’s  studying  involves  memorizing  information  or  procedures 
that  they  think  they  will  be  asked  on  a  test  (Schoenfeld,  in  press).  This  leads  to  the  problem  of 
"inert"  knowledge  (Collins,  Brown,  and  Newman,  in  press).  Facts  and  procedures  are  learned  in 
isolation,  apart  from  the  different  contexts  in  which  they  might  be  used.  As  we  argued  in  the 
earlier  paper,  learning  of  information  and  procedures  needs  to  be  "situated"  in  multiple  contexts 
reflecting  its  different  uses  in  real  world  contexts.  Otherwise  students  are  not  likely  to  see  how 
the  knowledge  they  are  getting  can  be  applied.  The  "what"  of  knowledge  is  only  a  third  of  what 
needs  to  be  learned;  we  also  need  to  know  the  "when"  and  "how"  it  applies  in  different  contexts, 
else  we  will  find  that  students  cannot  transfer  what  they  have  learned  to  new,  but  relevant, 
contexts. 

A  related  problem  was  identified  by  Schoenfeld  (1985)  for  tests  covering  course  material 
among  math  students.  They  develop  strategies  for  what  to  do,  based  on  idiosyncrasies  of  the 
course  and  test  problems.  For  example,  if  an  answer  doesn't  come  out  to  an  even  integer,  they 
think  it  is  wrong.  And  the  methods  they  consider  using  are  governed  by  what  material  the  test 
covers  (e  g.  addition  of  fractions,  algebra  work  problems),  rather  than  among  all  the  methods 
they  have  learned  up  to  then  in  their  mathematics  courses.  Thus  they  evolve  solution  methods 
that  are  counterproductive  for  solving  real  world  problems. 

In  summary,  when  testing  becomes  the  raison  d’etre  for  learning,  students  develop 
memorization  strategies  that  lead  to  decontextualized  knowledge  that  cannot  be  applied  later  in 
relevant  contexts.  Furthermore,  they  learn  problem  solving  strategies  that  are  counterproductive 
for  real  world  problems. 

Conventional  testing  fosters  a  mentality  that  turns  some  students  against  learning. 
Even  more  subtle  and  insidious  than  the  previous  two  side  effects  is  what  happens  when  poorer 
students  see  rewards  and  success  going  to  the  students  who  do  well  on  tests,  and  disapproval  to 


3 


BBN  Laboratories  Incorporated 


themselves  (c.f.  Dweck.  1986).  They  come  to  regard  learning  as  synonymous  with  doing  well  on 
tests,  and  since  they  do  not  do  well  on  tests,  they  do  not  want  to  compete  in  what  they  perceive 
will  be  a  losing  battle.  In  consequence,  they  come  to  regard  education  as  irrelevant  to  their 
interests  in  life  (e.g.,  becoming  an  athlete  or  beautician),  and  boring  as  compared  with  say  their 
social  life,  athletics,  etc.  This  is  not  to  say  that  removing  testing  from  the  schools  would 
completely  alleviate  the  problem  of  students’  loss  of  motivation  to  learn  anything  in  school, 
since  teachers’  expectations  undoubtedly  contribute  to  their  negative  self  image  as  learners. 
(Rosenthal  and  Jacobson,  1968).  But  tests  are  a  major  contributing  factor,  since  they  are  the 
means  by  which  students  are  publicly  labeled  as  inferior. 
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2.  Desiderata  for  a  New  Kind  of  Testing 


There  are  five  desiderata  that  I  view  as  critical  for  a  new  more  benign  learning  and  testing 
environment.  They  may  not  all  be  attainable,  but  they  serve  as  goals  to  strive  toward  in 
redesigning  testing. 

1 .  Tests  should  emphasize  learning  and  thinking.  A  test  in  any  domain  should 
emphasize  higher-order  thinking  skills  in  that  domain:  in  particular,  problem 
solving  strategies  (i.e.  heuristics),  self-regulatory  or  monitoring  strategies,  and 
learning  strategies  (Collins.  Brown  &  Newman,  in  press).  Dynamic  testing 
(Campione  &  Brown,  this  volume)  goes  some  way  toward  centering  testing  on  just 
such  issues.  These  higher-order  skills  are  what  we  want  students  to  learn,  and  so 
tests  must  focus  on  them. 

2.  Tests  should  require  generation  as  well  as  selection.  Most  tasks  in  the  real  world 
require  planning  and  executing,  but  multiple  choice  tests  only  require  choosing  the 
best  answer.  Hence  they  cannot  in  fact  measure  critical  aspects  of  thinking.  So  it 
is  important  that  tests  require  generation  of  ideas  by  students  (Frederiksen,  1984). 

3.  Tests  should  be  integral  to  learning.  As  presently  construed,  students  stop  learning 
when  they  take  a  test.  Occasionally  they  may  learn  something  going  over  a  test, 
but  this  happens  only  rarely.  The  major  positive  effects  of  tests  on  learning  then 
are  the  motivational  effects,  and  these  occur  mainly  with  teacher-generated  tests. 
Ideally  tests  should  not  be  intrusive  to  learning,  but  rather  integral  to  it.  This  is 
perhaps  the  most  difficult  of  the  five  desiderata  to  achieve. 

4.  Tests  should  serve  multiple  purposes.  I  have  alluded  to  some  of  the  purposes 
served  by  tests,  and  other  researchers  (cf.  Linn,  1986)  have  tried  to  enumerate  such 
purposes.  Let  me  list  some  of  those  purposes  lest  they  be  overlooked:  a) 
motivating  students  to  study  and  directing  that  study  to  certain  topics  or  issues,  b) 
diagnosing  what  difficulties  students  are  having  and  selecting  what  they  should 
study  next,  c)  placing  students  in  classes,  grades,  schools,  and  jobs,  d)  reporting  to 
students,  teachers,  and  parents  on  the  progress  a  student  has  made,  and  e) 
evaluating  how  well  a  teacher,  school,  or  school  system  is  doing  vis  a  vis  other 
teachers,  schools,  and  systems.  There  may  be  other  purposes  for  testing  but  these 
are  the  major  purposes.  In  some  sense  testing  must  serve  all  of  these  purposes  in 
one  way  or  another. 
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5.  Tests  should  be  valid  with  respect  to  all  their  purposes.  Test  makers  have  worried 
a  great  deal  about  reliability  and  validity  of  tests.  But  for  the  most  part  their 
concerns  about  validity  have  only  been  for  content  validity  and  predictive  validity 
of  the  tests  with  respect  to  future  schooling.  We  need  to  be  much  more  concerned 
about  the  validity  of  tests  with  respect  to  the  other  purposes  of  testing.  For 
example,  do  the  tests  really  measure  the  effectiveness  of  teaching?  Do  they 
motivate  students  to  learn  the  kinds  of  knowledge  and  higher-order  skills  we  want 
children  to  learn?  As  I  said  in  the  Introduction,  there  are  reasons  to  doubt  that  they 
do. 

Furthermore,  as  pointed  out  earlier,  when  tests  become  more  critical  in  making 
decisions,  teachers  and  students  direct  their  teaching  and  learning  to  do  well  on  the 
test.  Then  tests  lose  validity.  That  is,  to  the  degree  test  validity  depends  on  factors 
that  coincide  with,  but  are  not  the  same  as,  the  skills  required  in  the  future  school 
or  job,  preparing  for  the  test  reduces  the  predictive  validity  of  the  test.  So  that 
what  starts  out  as  a  highly  valid  test  may  lose  validity  as  it  becomes  more  visible  or 
decisive  in  making  selections. 

Suppose  for  example  that  aptitude  for  college  is  measured  with  a  vocabulary  test. 
Normally  a  vocabulary'  test  might  be  a  very’  good  predictor  of  how  well  someone 
will  do  in  college,  because  people  who  read  and  study  acquire  a  large  vocabulary  in 
the  process.  But  then  suppose  it  becomes  known  that  students  will  be  selected  for 
college  or  that  teachers  will  be  evaluated  for  effectiveness  on  the  basis  of  such  a 
test.  Then  it  behooves  the  student  or  teachers  to  concentrate  on  vocabulary,  which 
is  a  relatively  easy  thing  to  learn  as  compared  to  say  an  understanding  of  algebra  or 
literature.  When  that  happens,  the  vocabulary  test  ceases  to  be  a  good  predictor  of 
how  someone  will  do  in  college.  In  fact,  better  students  are  likely  to  regard 
learning  vocabulary  as  cheating  while  lesser  students  will  regard  it  as  necessary  for 
survival,  so  the  test  might  even  become  negatively  predictive  if  enough  attention  is 
focused  on  it.  Furthermore,  students  will  concentrate  their  energies  not  on  learning 
what  is  most  valuable  for  future  life,  but  rather  on  what  is  at  best  a  superficial 
index  of  learning  (i.e.  vocabulary).  This  is  what  Frederiksen  (1984)  refers  to  as  the 
"real  test  bias". 

In  this  example,  it  is  possible  to  substitute  any  number  of  things  for  vocabulary. 
For  example.  Ravens  matrices  or  analogy  problems  are  probably  ^ui r®  good 
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measures  of  general  problem  solving  ability,  unless  students  practice  on  such 
items.  If  they  do.  they  can  learn  (or  be  taught)  the  patterns  by  which  such  items  are 
constructed  and  so  they  do  not  then  have  to  figure  out  nearly  as  much  when  they 
come  to  take  the  test.  So  again  such  tests  will  lose  their  predictive  validity  if  they 
become  the  focus  of  study.  The  only  way  to  prevent  such  an  occurrence  is  to  have 
tests  that  reflect  all  the  knowledge,  skills,  and  strategies  necessary  for  success  in 
college  or  whatever  outcome  the  tests  are  designed  to  predict. 
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3.  Two  Scenarios  for  a  New  Testing  and  Learning  Environment 


Testing  is  undoubtedly  necessary  in  a  complex  society  where  we  need  to  make  decisions 
about  who  should  go  to  what  schools,  who  should  do  what  jobs,  and  what  should  be  taught  to 
different  students.  The  fundamental  question  about  testing  then,  as  I  see  it  is:  how  can  we 
construct  an  educational  system  that  embodies  testing  in  a  form  that  sustains  its  necessary 
functions  and  at  the  same  time  alleviates  the  problems  that  the  current  system  has  generated. 

The  three  papers  I  have  been  asked  to  comment  upon  by  Anderson,  Frederiksen  and  White, 
and  Kieras  point  the  way  to  a  possible  answer.  In  the  rest  of  the  paper  I  want  to  elaborate  at 
some  length  how  it  is  possible  to  take  the  ideas  implicit  in  intelligent  tutoring  systems,  and 
educational  computer  systems  more  generally,  to  construct  a  new  kind  of  learning  and  testing 
environment. 

There  are  two  scenarios  I  can  envision  for  exploiting  the  potential  of  computer  systems  for 
testing  The  first,  more  conservative  scenario  is  partly  depicted  in  the  papers  by  Frederiksen  and 
White  and  by  Kieras.  It  is  summarized  nicely  in  Frederiksen  and  White’s  title  "Intelligent  Tutors 
as  Intelligent  Testers"  and  it  goes  some  way  toward  addressing  the  desiderata  outlined  in  the 
previous  section.  The  second,  more  radical  scenario  envisions  a  completely  integrated  learning 
and  testing  environment. 

Intelligent  tutors  as  intelligent  testers.  In  this  first  scenario  intelligent  tutoring  systems 
become  the  devices  for  administering  tests  to  students.  The  tests  would  be  problem  solving  tests, 
where  problems  differing  in  difficulty  are  given  to  students.  The  test  would  start  with  easier 
problems,  and,  depending  on  how  well  the  student  does,  the  subsequent  problems  will  be  easier 
or  more  difficult,  as  with  adaptive  testing. 

As  they  solve  problems,  students  can  be  given  cognitive  feedback  on  how  best  to  solve 
these  kinds  of  problems.  That  is,  the  full  capability  of  the  tutoring  system  to  teach  the  students 
can  be  employed  as  part  of  the  testing  procedure.  The  test  then  will  measure  not  simply  their 
prior  ability  to  perform  the  kind  of  tasks  given  by  the  system,  but  in  addition  it  will  measure  how 
well  they  can  learn  to  perform  these  tasks  given  precisely  specified  cognitive  feedback  and 
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advice.  This  gives  the  intelligent  testing  system  the  same  kind  of  capabilities  for  measurement 
that  Campione  and  Brown  (this  volume)  have  developed  in  their  work  on  dynamic  assessment. 

Intelligent  tutoring  systems  require  the  student  to  generate  entire  sequences  of  actions  that 
lead  to  solutions,  whether  the  problems  are  programming  problems  as  in  Anderson’s  LISP  Tutor, 
or  electricity  problems  as  in  Frederiksen  and  White’s  Quest  tutor,  or  operational  problems  as  in 
Kieras’s  phaser  control  system.  While  the  responses  allowed  by  tutoring  systems  are  not  open 
ended  (i.e.,  there  is  usually  a  restricted  class  of  inputs  that  the  system  can  process),  they  are  not 
single-item,  multiple -choice  response  formats.  Thus,  the  responses  required  by  intelligent 
tutoring  systems  are  generative  in  the  sense  implied  in  the  desiderata  listed  earlier,  but  at  the 
same  time  they  are  precise  enough  to  be  evaluated  according  to  well-defined  criteria  necessary 
for  constructing  tests. 

Scoring  in  such  a  system,  can  be  based  on  the  same  kinds  of  measures  now  used  to  evaluate 
problem  solving:  percent  correct  in  solving  problems,  average  time  to  solve  problems,  number  of 
incorrect  vs.  correct  steps  taken  in  attempting  to  solve  a  problem,  etc.  But  to  the  degree  a 
system  has  a  characterization  of  what  expen  performance  requires,  as  in  Anderson’s  LISP  Tutor 
or  Frederiksen  and  White's  Quest,  it  is  possible  to  evaluate  students  more  directly.  In  the  case  of 
the  LISP  Tutor,  the  system  has  an  idealized  problem  solving  model  consisting  of  some  325 
production  rules,  which  represent  its  strategies  for  solving  programming  problems.  As  students 
work  problems  the  system  can  evaluate  the  degree  to  which  each  of  these  productions  is  used 
where  appropriate:  then  we  have  a  measure  of  how  much  of  the  expert  model  has  been  acquired. 
Similarly,  it  might  also  be  possible  to  assess  how  well  the  students  have  learned  to  suppress 
those  productions  in  the  system  representing  particular  misconceptions.  For  the  Quest  System, 
the  student’s  level  of  performance  can  be  evaluated  in  terms  of  how  far  along  the  progression  of 
more  and  more  sophisticated  models  a  student  has  advanced.  In  either  case  it  should  be  possible 
to  measure  both  the  student’s  current  level  of  understanding  of  the  domain,  and  the  rate  at  which 
he  or  she  is  learning  with  the  tutoring  system. 

Because  the  systems  can  analyze  sequences  of  actions,  they  have  a  capability  to  measure 
strategic  skills  as  well  as  domain  skills.  For  example,  Anderson,  Boyle,  and  Reiser’s  (1985) 
Geometry  Tutor  allows  students  to  work  forward  from  the  givens  or  backwards  from  the 
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statement  to  be  proved  in  constructing  geometry  proofs.  One  good  strategy  to  learn  is  first  to 
work  forw  ard  from  the  givens  a  little  way  to  see  their  implications  and  then  to  work  backwards 
from  the  statement  to  be  proved  in  order  to  close  the  gap.  A  good  "metacognitive  strategy”  is 
that  when  you  are  stuck  working  forward  or  backwards  (which  might  be  indicated  by  a  long 
pause),  switch  to  working  the  other  way.  In  a  system  such  as  the  Geometry  Tutor,  it  would  be 
possible  for  the  system  to  analyze  sequences  of  actions  (and  pauses)  by  the  students,  to  make 
suggestions  as  to  what  are  good  strategies,  and  to  evaluate  how  well  students  learn  to  approach 
problems  strategically  (Collins  &  Brown,  in  press). 

A  serious  limitation  of  today's  intelligent  tutoring  systems  is  that  they  only  exist  in  the 
domains  of  math  and  science.  This  is  because  computational  techniques  provide  the  most 
leverage  in  these  domains.  One  question  then  is  whether  computer  systems  have  any  role  to  play 
as  testing  systems  in  domains  such  as  reading,  writing,  and  history.  There  are  in  fact  less-than- 
intelligent  computer-based  teaching  systems  in  these  three  domains  that  might  be  useful. 

For  example,  in  the  domain  of  reading,  the  IRIS  system  (developed  by  WICAT-described 
in  Collins,  1986)  presents  passages  to  students  and  then  asks  questions  about  the  passages,  much 
like  a  reading  comprehension  test.  But  it  is  an  instructional  system,  so  that  students  receive 
cognitive  feedback  on  what  they  do  that  should  help  them  learn  to  read  better.  My  reservation 
about  this  particular  system  is  that  there  is  not  as  much  instruction  on  how  to  make  inferences  or 
monitor  one’s  comprehension  as  there  is  in  the  best  comprehension  instruction  (e  g.,  Palincsar  & 
Brown,  1984).  But  the  structure  is  potentially  there  to  do  so. 

The  most  relevant  computer  system  for  testing  writing  is  Writer's  Workbench  (McDonald, 
Frase,  Gingrich,  &  Keenan,  1982),  but  it  is  more  an  advisory  system  than  a  teaching  system.  It 
can  analyze  texts  in  terms  of  spelling,  word  usage,  and  grammar;  it  can  even  evaluate  overusage 
of  the  passive  voice,  frequency  of  empty  phrases  like  "there  are",  and  the  readability  of  the  text 
by  standard  readability  measures.  But  essentially  it  is  only  evaluating  surface  features  of  the 
text:  it  cannot  evaluate  clarity,  interest,  persuasiveness,  or  memorability  which  are  the  critical 
aspects  that  a  good  text  must  possess  (Collins  &  Gentner,  1980).  Designing  testing  around  a 
system  that  only  evaluated  surface  features  would  lead  the  teaching  of  writing  in  the  wrong 
direction.  But  that  is  all  that  computer-based  systems  will  be  capable  of  evaluating  in  the 
foreseeable  future. 
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The  most  ingenious  computer-based  teaching  system  for  history  is  Geography  Search 
developed  by  Tom  Snyder  (Kelman  et  al.  1981).  It  is  a  historical  simulation  of  the  time  after 
Columbus  discovered  America,  and  explorers  sailed  to  the  New  World  to  bring  back  its  wealth 
and  resources.  In  the  simulation  students  have  to  purchase  supplies  for  their  trip  to  the  New 
World,  and  navigate  using  sextant  and  compass.  They  must  plan  their  voyage  as  they  go, 
depending  upon  what  they  find  and  how  many  supplies  they  have  left  for  the  voyage  home 
Historical  simulations  such  as  Geography  Search  or  the  Civil  War  Game  by  Avalon  Hill,  give 
students  an  understanding  of  the  reasons  why  events  take  place  in  a  historical  context.  While 
Geography  Search  does  not  do  so,  it  would  certainly  be  possible  to  provide  a  computer  coach  to 
advise  students  as  they  engage  in  these  simulations.  In  such  a  scenario,  it  would  be  possible  to 
evaluate  how-  well  students  learn  to  plan  and  solve  problems  in  historical  contexts.  This  is  not 
what  we  usually  test  about  students'  understanding  of  history,  but  it  is  perhaps  an  equally  valid 
kind  of  historical  understanding.  Moreover,  most  of  the  important  concerns  of  history,  such  as 
the  development  of  the  Constitution  or  the  settling  of  the  American  frontier,  can  be  turned  into 
historical  simulations. 

In  summary,  the  plan  to  develop  intelligent  tutors  as  intelligent  testers  is  feasible  in  much  of 
the  current  school  curriculum.  It  has  several  benefits:  (a)  testing  would  be  focused  on  students 
problem  solving  and  planning  skills,  (b)  their  ability  to  learn  in  a  domain  as  well  as  their  prior 
knowledge  could  be  tested,  and  (c)  the  tests  could  be  adaptive  to  the  student’s  prior  knowledge, 
and  would  test  their  generative  abilities  instead  of  their  recognition  abilities.  But  such  a 
scenario,  while  feasible,  would  require  a  large  amount  of  effort  to  produce  intelligent  testing 
systems  that  cover  a  large  pan  of  the  curriculum.  Computer-based  teaching  systems  will  in  fact 
be  developed  to  cover  much  of  the  school  curriculum  in  the  next  decade,  given  the  expansion  of 
tools  and  resources  that  is  taking  place  in  the  field.  Whether  these  will  be  extended  to  address 
testing  concerns  is  still  an  open  question,  however. 

An  integrated  learning  and  testing  environment.  The  second,  more  radical  scenario  for  an 
integrated  learning  and  testing  environment  is  implicit  in  the  way  Anderson  (this  volume)  has 
analyzed  students’  learning  in  the  LISP  tutor.  The  tutor  was  built  to  teach  students  LISP,  but  as 
a  side  effect  of  the  teaching,  Anderson  collected  a  record  of  their  performance  with  the  tutor  that 
he  could  analyze  to  test  various  hypotheses.  Each  analysis  is  a  slice  through  the  data  to  answer 
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certain  questions.  He  can  look  at  students’  learning  curves,  error  rates,  response  times,  and  even 
factor  out  differences  between  their  ability  to  learn  versus  their  ability  to  remember.  That  is,  the 
computational  medium  enables  evaluation  to  be  carried  out  on  the  process  of  learning  Rather 
than  stopping  to  take  a  test,  the  testing  comes  free  in  the  course  of  the  teaching. 

This  view  of  testing  first  evolved  to  my  knowledge  in  a  cognitive  science  working  group  at 
a  conference  on  testing  (Tyler  and  White,  1980).  The  analogy  we  used  was  to  professional 
sports  like  baseball  or  football  where  extensive  records  are  kept  on  players  (by  scorekeeping  and 
videotaping),  so  that  their  performance  can  be  evaluated  from  different  perspectives:  e.g.,  in 
baseball,  the  batting  percentage  with  men  on  base,  the  number  of  runners  left  stranded  by  a 
player,  the  batting  percentage  against  left  handed  vs.  right  handed  pitchers,  etc.  Different 
statistics  are  used  to  make  different  decisions:  should  you  keep  the  player  or  send  him  to  the 
minors,  where  should  he  be  in  the  batting  order,  in  what  situations  should  he  be  used  as  a  pinch 
hitter,  what  should  he  practice,  etc.  In  sports  all  the  variety  of  questions  we  try  to  answer  on  the 
basis  of  tests  in  school  are  answered  on  the  basis  of  analysis  of  actual  performance  in  the  field. 

In  this  scenario,  students  work  with  computers  either  in  groups  or  individually.  The 
teacher’s  role  is  that  of  coach  rather  than  instructor.  She  will  suggest  tasks  and  activities  for 
students  to  engage  in,  give  them  advice  or  help  when  they  need  it,  and  monitor  how  they  are 
progressing.  This  scenario  assumes  that  there  is  a  variety  of  good  educational  software,  as  well 
as  computational  tools  (e.g.  word  processors  and  writing  coaches,  statistical  and  graphing 
programs,  and  computer-based  laboratories).  The  students  would  spend  their  day  working  with 
different  programs,  for  example  using  the  LISP  Tutor  or  Quest,  doing  science  projects  with 
statistical  and  graphing  programs,  debating  with  students  in  other  schools  via  electronic 
networks,  etc.  The  computers  and  teachers  would  be  assistants  to  the  students’  self  learning. 

My  claim  is  that  in  this  environment  all  the  functions  of  testing  can  be  realized  without 
students  taking  tests  per  se,  and  that  all  the  desiderata  for  testing  can  be  achieved  without  the  bad 
side  effects  described.  There  are  three  kinds  of  measures  that  occur  in  this  scenario  that  can  be 
used  to  carry  out  the  multiple  functions  of  testing: 

1.  Diagnosis.  Diagnosis  is  distributed  between  computer,  teacher,  and  students. 

Many  computer  tutors,  such  as  the  LISP  Tutor  (Anderson,  this  volume),  carry  out 
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some  form  of  diagnosis.  In  the  case  of  the  LISP  Tutor  the  diagnosis  is  extremely 
local;  it  only  looks  for  specific  errors  students  may  make  at  each  step  and  gives 
advice  accordingly.  Other  computer  tutors,  such  as  Sophie  (Brown,  Bunon,  & 
deKleer;  1982)  and  WEST  (Bunon  &  Brown,  1982)  perform  much  more  global 
analyses  of  the  students  misunderstandings  and  errors.  Frederiksen  and  White  (this 
volume)  suggest  providing  aids  so  that  students  can  do  self  diagnosis,  which  should 
prove  even  more  effective  than  computational  analysis  alone.  Finally,  the  teacher 
will  be  available  to  interact  with  students  on  a  one-to-one  basis  as  a  coach,  and 
hence  should  be  able  to  build  up  a  better  picture  of  the  difficulties  particular 
students  are  having  than  in  the  traditional  classroom. 

2.  Summary  Statistics.  As  Anderson  (this  volume)  has  done  with  the  LISP  Tutor,  it  is 
possible  to  keep  records  of  what  students  do  while  they  are  learning  and  analyze 
these  records  to  report  to  different  audiences  on  students’  progress.  For  example,  a 
report  to  administrators  might  summarize  how  many  students  went  all  the  way 
through  the  LISP  Tutor  and  the  Geometry  Tutor,  and  how  fast  they  went  through 
each.  A  report  to  parents  might  describe  what  tutoring  systems  their  child  worked 
with,  what  kind  of  progress  they  made,  how  hard  they  tried  (in  terms  of  how  long 
they  stuck  with  various  programs,  particularly  when  they  were  having  difficulty) 
and  any  other  measures  that  parents  request,  individually  or  collectively.  Reports 
to  teachers  might  summarize  the  kinds  of  difficulties  each  student  is  having,  and 
the  amount  each  student  learned  using  the  different  programs  (in  terms  of  the 
difference  between  their  scores  in  the  beginning  and  their  scores  in  the  final 
sessions).  In  fact,  teachers  could  be  given  the  capability  of  requesting  different 
kinds  of  analyses  be  made  on  the  data,  just  as  Anderson  did  with  the  LISP  Tutor. 
There  are  other  audiences  and  others  way  of  analyzing  such  data,  but  these 
examples  suffice  to  show  what  might  be  done. 

3.  Portfolios.  Some  computer-based  teaching  systems  keep  a  library  of  students’  best 
work.  As  far  as  I  know,  the  idea  was  first  used  in  the  Plato  math  curriculum 
(Dugdale  and  Kibbey,  1975).  An  excellent  example  of  a  library  is  the  one  in  Green 
Globs,  a  game  to  teach  analytic  geometry  developed  by  Sharon  Dugdale.  The 
game,  which  is  a  part  of  a  larger  set  of  computer-based  activities  to  teach  analytic 
geometry,  requires  students  to  write  equations  for  curves  to  go  through  fifteen 
green  globs  placed  randomly  on  a  Cartesian  plane.  The  more  green  globs  any 


13 


BBS  Laboratories  Incorporated 


curve  goes  through  (each  glob  only  counts  once),  the  more  points  students  score 
for  that  curve  (the  nth  glob  scores  2n'J  points).  In  the  library  are  stored  the  highest 
scoring  games  played,  showing  where  the  globs  were  placed  and  the  equations 
written  to  make  the  high  score.  The  name  of  the  player  who  scored  each  game  is 
also  listed,  thus  gamering  fame  for  a  good  performance. 

The  concept  of  the  library  can  be  extended  to  the  personal  portfolio  that  students 
keep  as  a  record  of  their  accomplishments.  The  portfolio  could  record  the  students 
best  compositions,  game  performances,  or  problem  solutions.  Art  schools  and 
architecture  firms  require  portfolios  to  help  them  determine  who  should  be 
admitted.  This  is  because  they  know  it  is  impossible  to  evaluate  the  creative  skills 
of  a  person  in  terms  of  standard  tests.  As  we  move  to  a  society  where  learning  and 
thinking  are  critical,  the  same  problems  arise  with  standard  tests  in  other  domains. 

So  by  basing  placement  decisions  at  least  in  part  upon  student  portfolios  (they  may 
be  based  in  part  on  summary  statistics  described  above),  the  decision  will  take  into 
account  creativity  as  well  as  selectivity. 

Moreover,  basing  decisions  upon  accomplishments  rather  than  simply  on  measured 
aptitudes  reflects  more  realistically  the  way  decisions  are  made  in  the  real  world. 

We  value  employees  or  students  who  do  good  things,  not  those  who  merely  have 
the  capability  to  do  good  things.  By  stressing  accomplishment  in  our  decisions,  we 
change  the  motivation  structure  for  students  in  school.  The  emphasis  will  change 
from  a  concern  with  doing  well  on  tests  to  producing  good  works. 

In  summary,  let  me  review  briefly  how  this  scenario  addresses  the  desiderata  and  concerns 
raised  in  the  first  two  sections.  The  scenario  entails  moving  away  from  testing  per  se  to  analysis 
of  the  ongoing  learning  and  accumulation  of  the  products  produced  by  that  learning.  There  is  no 
lowering  of  standards  to  what  can  be  measured,  nor  any  overemphasis  on  doing  well  on  tests. 
Moreover,  there  will  be  little  stigmatizing  of  students  for  their  poor  performance;  rather  they  will 
be  rewarded  for  good  products.  The  emphasis  on  learning  and  thinking  will  be  central. 
Furthermore,  the  three  kinds  of  measures  discussed  can  validly  serve  all  the  purposes  of  testing 
in  today’s  school.  The  scenario  describes  a  truly  integrated  learning  and  testing  environment. 
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4.  Conclusion 


The  introduction  of  computers  into  our  education  system  provides  an  opportunity  to  rethink 
the  whole  relationship  between  testing  and  learning.  There  are  serious  problems  with  the  way 
testing  currently  drives  our  education  system:  it  fosters  emphasis  on  lower-order  rather  than 
higher-order  skills  and  encourages  stigmatization  of  students  who  do  not  do  well.  Further  testing 
as  presently  construed  has  only  worried  about  content  and  predictive  validity  rather  than  about 
validity  with  respect  to  the  many  other  purposes  of  testing.  But  by  repositioning  testing  in 
computer-based  learning  environments,  many  of  the  problems  with  testing  as  currently  construed 
can  be  alleviated. 
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